Skip to content

Cherry-pick [2.7] Fix recipe API bug list and harden recipe behavior (#4228)#4293

Open
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4228
Open

Cherry-pick [2.7] Fix recipe API bug list and harden recipe behavior (#4228)#4293
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4228

Conversation

@YuanTingHsieh
Copy link
Collaborator

Issues addressed

This PR addresses the reported recipe bug list:

  • BUG-8 (Critical): CyclicRecipe model=None + initial_ckpt created phantom persistor reference and runtime failure.
  • BUG-1 (Critical): CyclicRecipe dict model config caused TypeError.
  • BUG-7 (Critical): base CyclicRecipe silently ignored initial_ckpt.
  • BUG-9 (High): FedAvgRecipe per_site_config[\"server\"] caused invalid target handling/crash.
  • BUG-4 (High): Recipe.execute/export mutated job state and made recipe reuse unsafe.
  • BUG-2 (Medium): FedAvgRecipe per-site or fallback ignored falsy overrides.
  • BUG-3 (Medium): SimEnv with num_clients=0 and client list resolved num_threads=0.
  • BUG-5 (Low): base FedAvgRecipe accepted initial_ckpt + model=None but produced incomplete model source.
  • BUG-6 (Low): add_cross_site_evaluation idempotency guard depended only on transient _cse_added attribute.

Changes

  • Hardened base CyclicRecipe model/persistor setup:
    • fail-fast when a valid persistor cannot be configured;
    • support dict model config for supported frameworks;
  • apply/validate initial_ckpt for wrapper and framework-specific paths.
  • Hardened base FedAvgRecipe:
  • validate and reject reserved per_site_config targets (server, @ALL);
  • preserve falsy per-site override values via explicit is not None fallback;
    • fail-fast when no persistor and no model params are available.
  • Added shared framework persistor setup utility in recipe utils and wired both Cyclic/FedAvg paths to it.
  • Made Recipe.execute() and Recipe.export() reusable-safe by snapshot/restore of temporary execution params.
  • Fixed SimEnv default resolution for num_clients/num_threads when explicit client list is provided.
  • Hardened CSE idempotency with workflow-based detection in addition to _cse_added fast-path.
  • Updated impacted examples and exports:
    • hello-cyclic now passes min_clients explicitly;
    • hello-numpy uses transfer-type API;
    • recipe module exports include add_cross_site_evaluation.
  • Expanded recipe unit tests with targeted regressions for all above behaviors.

Reason for these changes

These updates enforce clear model-source contracts and remove silent/implicit fallback behavior that could generate invalid jobs or runtime crashes. The goal is deterministic recipe behavior, safe object reuse, and predictable per-site/CSE configuration semantics in 2.7.

Affected files

Core implementation:

  • nvflare/recipe/cyclic.py
  • nvflare/recipe/fedavg.py
  • nvflare/recipe/spec.py
  • nvflare/recipe/sim_env.py
  • nvflare/recipe/utils.py
  • nvflare/recipe/poc_env.py
  • nvflare/recipe/__init__.py

Framework wrappers / examples:

  • nvflare/app_opt/pt/recipes/cyclic.py
  • nvflare/app_opt/tf/recipes/cyclic.py
  • examples/hello-world/hello-cyclic/job.py
  • examples/hello-world/hello-numpy/job.py
  • examples/tutorials/job_recipe.ipynb

Tests:

  • tests/unit_test/recipe/cyclic_recipe_test.py
  • tests/unit_test/recipe/fedavg_recipe_test.py
  • tests/unit_test/recipe/spec_test.py
  • tests/unit_test/recipe/sim_env_test.py
  • tests/unit_test/recipe/utils_test.py

Test strategy

  • Targeted regression suites:
  • pytest -q tests/unit_test/recipe/cyclic_recipe_test.py tests/unit_test/recipe/fedavg_recipe_test.py tests/unit_test/recipe/spec_test.py
  • pytest -q tests/unit_test/recipe/sim_env_test.py tests/unit_test/recipe/utils_test.py
  • Full recipe suite:
    • pytest -q tests/unit_test/recipe
  • Result: recipe suite passed (183 passed, 13 skipped).

Made with Cursor

Fixes # .

Description

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 11, 2026 01:30
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR is a comprehensive bug-fix and hardening pass for the NVFlare 2.7 Recipe API, addressing nine reported bugs ranging from phantom persistor references to silent per-site config override failures. The changes enforce deterministic model-source contracts, make execute/export reusable-safe via a snapshot/restore context manager, and fix SimEnv/PocEnv client-count resolution edge cases.

Key changes and observations:

  • Fail-fast persistor validation (CyclicRecipe, FedAvgRecipe): the old "default to 'persistor'" fallback is replaced by explicit ValueError, which is the right direction for deterministic behavior.
  • _temporary_exec_params context manager (spec.py): correctly restores additional_params after execute/export, though only additional_params is snapshotted — new apps/components added by process_env are still not reverted (documented in a prior thread).
  • setup_custom_persistor / extract_persistor_id helpers (utils.py): centralise dict-vs-string return normalisation, eliminating the result["persistor_id"] KeyError that motivated several bugs.
  • Per-site config reserved-target validation (fedavg.py): correctly blocks 'server' and '@ALL' as per_site_config keys before job construction.
  • SimEnv / PocEnv num_clients resolution: BUG-3 is cleanly fixed; the PocEnv None-guard prevents a TypeError when Pydantic does not coerce the field.
  • add_cross_site_evaluation idempotency: the workflow-scan fallback (_has_cross_site_eval_workflow) provides resilience when the _cse_added transient attribute is lost.
  • In nvflare/recipe/cyclic.py (line 227), setattr(self.model, "initial_ckpt", ckpt_path) mutates the caller's model wrapper in-place. If the same wrapper is passed to two recipes the second sees the injected checkpoint value.
  • In nvflare/app_opt/pt/recipes/fedavg.py (lines 172–175), the custom-persistor branch registers _pt_model_locator with job.to_server(..., id="locator") but only stores the result in comp_ids when it is a non-empty string. Locators implementing add_to_fed_job return a dict, so locator_id would be silently dropped from comp_ids, breaking any downstream add_cross_site_evaluation call.
  • Test suite additions are solid and cover all nine bugs; one assertion (launcher._script.startswith(" custom/")) tests private ScriptRunner internals in a brittle way.

Confidence Score: 3/5

  • Mostly safe to merge, but the silent locator_id drop in the PT FedAvg custom-persistor path and the in-place mutation of the caller's model wrapper should be addressed before merging to avoid hard-to-diagnose regressions in cross-site evaluation workflows.
  • The core bug fixes are sound and well-tested (183 passed, 13 skipped). However, two new issues were introduced: (1) the locator_id registration in nvflare/app_opt/pt/recipes/fedavg.py silently fails for locators with add_to_fed_job, which would break cross-site evaluation without a clear error; (2) the in-place setattr on the caller's model wrapper could cause hard-to-debug checkpoint contamination when wrappers are reused. Both issues are localized but could produce confusing silent failures in production.
  • Pay close attention to nvflare/app_opt/pt/recipes/fedavg.py (locator_id dict-return handling) and nvflare/recipe/cyclic.py (in-place model wrapper mutation).

Important Files Changed

Filename Overview
nvflare/recipe/cyclic.py Hardened persistor setup: fail-fast when no persistor can be configured, stores min_clients/num_rounds as instance attrs, and applies initial_ckpt to wrappers — but the setattr mutation on a caller-provided model wrapper object is a subtle side-effect concern.
nvflare/recipe/fedavg.py Added per_site_config validation (reserved target rejection), is-not-None fallback for falsy overrides, fail-fast for no persistor/model-params, and extracted setup_custom_persistor utility — clean improvements with good coverage.
nvflare/recipe/spec.py Added contextmanager-based snapshot/restore for additional_params making execute/export reusable-safe; extra=None guard improvement is correct; snapshot trigger uses Python truthiness which the previous thread already flagged.
nvflare/recipe/utils.py Added extract_persistor_id, resolve_initial_ckpt, setup_custom_persistor helpers and workflow-based CSE idempotency check; _collect_non_local_scripts renamed to public collect_non_local_scripts.
nvflare/app_opt/pt/recipes/fedavg.py Refactored to use shared setup_custom_persistor utility; custom-persistor path now explicitly updates comp_ids; locator registration in the custom-persistor branch has a type-narrowing gap where dict return values from job.to_server silently drop locator_id from comp_ids.
nvflare/app_opt/tf/recipes/fedavg.py Refactored to use shared utilities; custom-persistor early-return path does not update job.comp_ids["persistor_id"] (already flagged in previous thread); TFModel path uses extract_persistor_id correctly but still skips comp_ids update.
nvflare/app_opt/pt/recipes/cyclic.py Adds allow_numpy_conversion derived from server_expected_format, uses resolve_initial_ckpt, explicit model=None+ckpt guard; minor: extract_persistor_id applied consistently.
nvflare/app_opt/tf/recipes/cyclic.py Uses resolve_initial_ckpt and extract_persistor_id; unlike PT path there is no explicit model=None guard, but this is intentional since TF SavedModels can be loaded checkpoint-only.
nvflare/recipe/sim_env.py BUG-3 fix: resolves num_clients and num_threads from client list length when num_clients=0; straightforward and covered by new tests.
nvflare/recipe/poc_env.py Adds None-guard for num_clients in validator to prevent TypeError; renames _collect_non_local_scripts import.
tests/unit_test/recipe/fedavg_recipe_test.py Comprehensive new regression tests for all flagged bugs; one assertion checks a private attribute in a brittle way that may break with ScriptRunner refactors.
tests/unit_test/recipe/spec_test.py Adds full execute/export param-isolation tests; updated fixtures switch from NUMPY+ckpt to dict model config, consistent with new fail-fast behavior.
tests/unit_test/recipe/utils_test.py Good coverage for all new utility functions (extract_persistor_id, resolve_initial_ckpt, setup_custom_persistor, CSE idempotency, package exports).

Sequence Diagram

sequenceDiagram
    participant User
    participant Recipe
    participant Job as FedJob
    participant Env as ExecEnv

    User->>Recipe: execute(env, server_exec_params, client_exec_params)
    activate Recipe
    Recipe->>Recipe: _temporary_exec_params() [snapshot additional_params]
    Recipe->>Job: to_server(server_exec_params)
    Recipe->>Job: _add_to_client_apps(client_exec_params)
    Recipe->>Recipe: process_env(env)
    Recipe->>Env: deploy(job) → job_id
    Recipe->>Recipe: [finally] _restore_additional_params(snapshot)
    Recipe-->>User: Run(env, job_id)
    deactivate Recipe

    User->>Recipe: export(job_dir, server_exec_params, client_exec_params)
    activate Recipe
    Recipe->>Recipe: _temporary_exec_params() [snapshot additional_params]
    Recipe->>Job: to_server(server_exec_params)
    Recipe->>Job: _add_to_client_apps(client_exec_params)
    Recipe->>Job: export_job(job_dir)
    Recipe->>Recipe: [finally] _restore_additional_params(snapshot)
    deactivate Recipe
Loading

Last reviewed commit: 6f78c56

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks a set of fixes into the 2.7 branch to harden recipe behavior around model/persistor configuration, initial_ckpt handling, per-site config semantics, and recipe reuse (execute/export), with corresponding regression tests and example updates.

Changes:

  • Hardened CyclicRecipe/FedAvgRecipe model-source setup (persistor wiring, initial_ckpt resolution, per-site config validation).
  • Made Recipe.execute() / Recipe.export() reusable-safe via snapshot/restore of temporary execution params.
  • Added/updated utilities and expanded unit tests + refreshed examples/notebook content.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
nvflare/recipe/utils.py Adds helpers for persistor id extraction, initial checkpoint resolution, and stronger CSE idempotency checks; renames script collection helper to public API.
nvflare/recipe/spec.py Adds temporary exec-param isolation to keep recipes reusable across execute() / export().
nvflare/recipe/sim_env.py Fixes num_clients/num_threads resolution when explicit clients list is provided.
nvflare/recipe/prod_env.py Updates to use renamed collect_non_local_scripts helper.
nvflare/recipe/poc_env.py Tightens validation for num_clients being None and updates script validation helper usage.
nvflare/recipe/fedavg.py Hardens per-site config semantics, fails fast when no model source, and centralizes custom persistor setup.
nvflare/recipe/cyclic.py Stores validated attributes (e.g., min_clients), enforces persistor availability, and hardens wrapper/ckpt behavior.
nvflare/recipe/__init__.py Exposes add_cross_site_evaluation from top-level nvflare.recipe.
nvflare/app_opt/pt/recipes/fedavg.py Refactors persistor setup to support custom persistors and consistent id extraction.
nvflare/app_opt/pt/recipes/cyclic.py Refactors persistor setup and checkpoint resolution; adds explicit PT ckpt-without-model rejection.
nvflare/app_opt/tf/recipes/fedavg.py Refactors persistor setup with shared helpers and checkpoint resolution.
nvflare/app_opt/tf/recipes/cyclic.py Refactors persistor setup and checkpoint resolution; uses shared id extraction.
tests/unit_test/recipe/utils_test.py Adds tests for new persistor helpers, package exports, and CSE idempotency behavior.
tests/unit_test/recipe/spec_test.py Updates for renamed script collector and adds execute/export param-isolation regressions.
tests/unit_test/recipe/sim_env_test.py Adds BUG-3 regression for client/thread derivation from explicit client list.
tests/unit_test/recipe/poc_env_test.py Adds regression ensuring num_clients=None fails with ValueError rather than TypeError.
tests/unit_test/recipe/fedavg_recipe_test.py Adds regressions for reserved per-site targets, falsy override preservation, and persistor/locator wiring.
tests/unit_test/recipe/cyclic_recipe_test.py Expands base cyclic coverage for wrapper/ckpt behavior and persistor-id extraction.
examples/hello-world/hello-numpy/job.py Switches to transfer-type API for update type selection.
examples/hello-world/hello-cyclic/job.py Passes min_clients explicitly.
examples/tutorials/job_recipe.ipynb Updates documentation to reflect relative/absolute checkpoint semantics and environment argument descriptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Issues addressed
This PR addresses the reported recipe bug list:

- `BUG-8` (Critical): `CyclicRecipe` `model=None + initial_ckpt` created
phantom persistor reference and runtime failure.
- `BUG-1` (Critical): `CyclicRecipe` dict model config caused
`TypeError`.
- `BUG-7` (Critical): base `CyclicRecipe` silently ignored
`initial_ckpt`.
- `BUG-9` (High): `FedAvgRecipe` `per_site_config[\"server\"]` caused
invalid target handling/crash.
- `BUG-4` (High): `Recipe.execute/export` mutated job state and made
recipe reuse unsafe.
- `BUG-2` (Medium): `FedAvgRecipe` per-site `or` fallback ignored falsy
overrides.
- `BUG-3` (Medium): `SimEnv` with `num_clients=0` and client list
resolved `num_threads=0`.
- `BUG-5` (Low): base `FedAvgRecipe` accepted `initial_ckpt` +
`model=None` but produced incomplete model source.
- `BUG-6` (Low): `add_cross_site_evaluation` idempotency guard depended
only on transient `_cse_added` attribute.

## Changes
- Hardened base `CyclicRecipe` model/persistor setup:
  - fail-fast when a valid persistor cannot be configured;
  - support dict model config for supported frameworks;
- apply/validate `initial_ckpt` for wrapper and framework-specific
paths.
- Hardened base `FedAvgRecipe`:
- validate and reject reserved `per_site_config` targets (`server`,
`@ALL`);
- preserve falsy per-site override values via explicit `is not None`
fallback;
  - fail-fast when no persistor and no model params are available.
- Added shared framework persistor setup utility in recipe utils and
wired both Cyclic/FedAvg paths to it.
- Made `Recipe.execute()` and `Recipe.export()` reusable-safe by
snapshot/restore of temporary execution params.
- Fixed `SimEnv` default resolution for `num_clients`/`num_threads` when
explicit client list is provided.
- Hardened CSE idempotency with workflow-based detection in addition to
`_cse_added` fast-path.
- Updated impacted examples and exports:
  - `hello-cyclic` now passes `min_clients` explicitly;
  - `hello-numpy` uses transfer-type API;
  - recipe module exports include `add_cross_site_evaluation`.
- Expanded recipe unit tests with targeted regressions for all above
behaviors.

## Reason for these changes
These updates enforce clear model-source contracts and remove
silent/implicit fallback behavior that could generate invalid jobs or
runtime crashes. The goal is deterministic recipe behavior, safe object
reuse, and predictable per-site/CSE configuration semantics in 2.7.

## Affected files
Core implementation:
- `nvflare/recipe/cyclic.py`
- `nvflare/recipe/fedavg.py`
- `nvflare/recipe/spec.py`
- `nvflare/recipe/sim_env.py`
- `nvflare/recipe/utils.py`
- `nvflare/recipe/poc_env.py`
- `nvflare/recipe/__init__.py`

Framework wrappers / examples:
- `nvflare/app_opt/pt/recipes/cyclic.py`
- `nvflare/app_opt/tf/recipes/cyclic.py`
- `examples/hello-world/hello-cyclic/job.py`
- `examples/hello-world/hello-numpy/job.py`
- `examples/tutorials/job_recipe.ipynb`

Tests:
- `tests/unit_test/recipe/cyclic_recipe_test.py`
- `tests/unit_test/recipe/fedavg_recipe_test.py`
- `tests/unit_test/recipe/spec_test.py`
- `tests/unit_test/recipe/sim_env_test.py`
- `tests/unit_test/recipe/utils_test.py`

## Test strategy
- Targeted regression suites:
- `pytest -q tests/unit_test/recipe/cyclic_recipe_test.py
tests/unit_test/recipe/fedavg_recipe_test.py
tests/unit_test/recipe/spec_test.py`
- `pytest -q tests/unit_test/recipe/sim_env_test.py
tests/unit_test/recipe/utils_test.py`
- Full recipe suite:
  - `pytest -q tests/unit_test/recipe`
- Result: recipe suite passed (`183 passed, 13 skipped`).

Made with [Cursor](https://cursor.com)
@YuanTingHsieh
Copy link
Collaborator Author

this is a cherry-pick additional comments will be addressed in another PR

f"Conflicting checkpoint values: model wrapper has initial_ckpt={existing_ckpt}, "
f"but recipe initial_ckpt={ckpt_path}."
)
setattr(self.model, "initial_ckpt", ckpt_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place mutation of a caller-provided model wrapper

setattr(self.model, "initial_ckpt", ckpt_path) modifies the wrapper object that the caller passed in. If the same PTModel / TFModel instance is passed to two different recipes (or to execute + export in succession), the second call will observe the initial_ckpt that was injected by the first, which may not be what the caller intended.

Consider either copying the wrapper before mutating it, or documenting clearly that the recipe takes ownership of the wrapper and callers must not reuse the same instance:

import copy
model_wrapper = copy.copy(self.model)
setattr(model_wrapper, "initial_ckpt", ckpt_path)
result = job.to_server(model_wrapper, id="persistor")
return extract_persistor_id(result)

A shallow copy is sufficient because only the initial_ckpt attribute is changed.

Comment on lines +172 to +175
if self._pt_model_locator is not None:
locator_id = job.to_server(self._pt_model_locator, id="locator")
if isinstance(locator_id, str) and locator_id:
job.comp_ids["locator_id"] = locator_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locator_id silently dropped when job.to_server returns a dict

job.to_server(self._pt_model_locator, id="locator") can return a dict when the locator implements add_to_fed_job (e.g., a future custom PTFileModelLocator subclass or any locator wrapper). The guard isinstance(locator_id, str) and locator_id would then be False, so comp_ids["locator_id"] is never populated.

add_cross_site_evaluation later reads recipe.job.comp_ids.get("locator_id", "") (via locator_config), and a missing key leads to the locator not being wired up for cross-site evaluation without any error being raised — a silent failure.

Applying extract_persistor_id (or an equivalent extract_component_id helper) here would make the handling consistent with how persistor_id is resolved elsewhere in the same method:

if self._pt_model_locator is not None:
    raw = job.to_server(self._pt_model_locator, id="locator")
    locator_id = raw if isinstance(raw, str) and raw else (
        raw.get("locator_id", "") if isinstance(raw, dict) else ""
    )
    if locator_id:
        job.comp_ids["locator_id"] = locator_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants